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ABSTRACT 

This study attempted to determine the effectiveness 
of cloze procedures as norm^ref erenced instruments by comparing the 
differential responses of four groups of college students of English 
as a second language on two identical cloze passages. The responses 
were scored using both eKdCt-^answer and acceptable---word methods. The 
results indicate that the effectiveness, measured as reliability and 
validity, appears to be strongly related to how well a given cloze 
passage fits a given student sample. This suggests, in turn, that 
pretesting any cloze passage is necessary so that an appropriate 
passage can be selected and modified or tailored to fit a certain 
group of students. Taking some or all of these steps should help 
produce a more reliable and valid norm-referenced instrument on whose 
scores responsible decisions about students can be based. (MSE) 
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A Cloze is a Cloze is a Cloze? 

James Oean Brown 



INTRODUCTION 

In 1953, Taylor Tirst discussed a procedure, whereby some of the words in a 
written text were replaced by blanks and students were required to Till them in. He 
called this procedure '^cloze*' from the Gestalt psychology notion of closure, or 
human ability to fill gaps. Since then, there has been an explosion of research on 
cloze when applied to native English speakers and, more recently, on its utility 
among nonnative students of English (for overviews of this research, sec Oiler 1975; 
Oiler 1979: 340-80). 

For native speakers, cloze was originally designed by Taylor as a measure of the 
readability of texts. A great deal of workfollov/ed on this aspect of cloze(Oller 1979: 
348-54). As an offshoot of this readability research, a number of studies have also 
been produced on cloze procedure as a measure of native*speaker reading 
comprehension ability (Brown 1978: 12-14). Criterion-related validity coefficients 
were calculated between cloze and various standardized reading tests in these 
studies. They ranged from .25 to .95. The squared values for these coefHcients, .06 to 
.90, indicate the percent of shared, or overlapping, variance between cloze and the 
reading test in eich study. It is safe to conclude from these results that cloze has been 
shown to be both a very weak (6 percent) and highly valid (90 percent) test of reading 
comprehension for native speakers— and almost everything in between as well. 
[V\ For nonnatives, much of the work has been done on the value of cloze as a test 

of overall second language proHciency. Often studies focus on one or both of the key 
characteristics of a lest: reliability and validity. For instance, studies have shown that 
cloze can be fairly reliable, that is, it produces consistent results. Such studies have 



indicated reliability indices ranging from .53 to .96 for various cloze passages 
^ (Darnell 1970; Oiler 1972b; Pike 1973; Jonz 1976; Alderson 1979; Mullen 1979; 
^ Brown 1980; Hinofotis 1980; Brown 1983). Reliability coefficients can be interpreted 
^ as the percentage of reliable (consistent) variance in a test. Thus, cloze passages have 
. % been shown to have weak reliability (53 percent reliable variance) as well as high 
reliability (96 percent reliable variance)— and almost everything in between as well. 
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The validity of cloze in second language situations has also been investigated 
(Conrad 1970; Darnell 1970; Oiler and Inal 1971; Oiler 1972 c and b; Irvine et al 
1974; Stubbs and Tucker 1974; Alderson 1979 and 1980; Brown 1980; Hinofotis 
1980; Mullen 1979). Validity is deHned as the degree to which a test measures what it 
claims to be measuring—in this case, overall second language proficiency. 
Generally, this has been demonstrated (as criterion-related validity) by showing the 
strength of association between scores on a cloze test and those on a standardized 
language placement or proficiency examination. Coefficients of .43 to .91 have been 
reported in these studies. And again, the squared values of these coefficients, . 1 9 to 
.83, indicate the percent of shared, or overlapping, variance between a given cloze test 
and the criterion measure. Hence, cloze has been shown, to be a weak (19 percent) 
measure of overall lai>guage proficiency, as well as a fairly strong one (83 percent)— 
and almost everything in between as well. 

It appears, then, that the results of studies on the reliability and validity of cloze 
procedure have varied greatly over the years. And in all fairness, it should be pointed 
out that investigators were changing cloze in the following ways within and between 
studies: 

1) seven difTerent scoring methods have been used 

2) numerous deletion patterns have been tried 

3) blank lengths have been modified 

4) passage difficulties have been varied 

5) test length has been changed 

6) and a variety of different samples have been used. 

These variables have been manipulated, consciously and unconsciously, in 
search of more effective ways to construct and interpret cloze tests. Generally 
speaking, variables one through five above have been purposefully manipulated or 
controlled in the second-language research. Variable six, the effect of different 
samples, has not been investigated sufficiently, which seems strange given that cloze 
procedure was originally shown to be very sample sensitive— so sensitive that 
readability grade levels could be established by using it (Taylor 1953). 

In fact, sampling is an important consideration in many second language 
studies. After all, it is simple common sense that a sample of nonnative students 
taken at a university in Great Britain may be quite different from one taken at UCLA 
or in Papua New Guinea. Just such differences in samples exist in the studies cited 
above and this variable alone may have much to do with the wide variety of results. 
For example, Ebei (1979:290-91) has pointed out that the reliability of aset of test 
scores depends in part on the ""range talent** in the group tested. In fact, restrictions 
in the range of talent can depress both reliability and validity coefficients in general 
(Shavelson 1981). 

The purpose of this study, then, is to investigate the effects of differences in 
samples on cloze test results by addressing the following more specific research 
questions: 

1) What are the effects of different ranges of talent on the apparent reliability 
and validity of cloze? 

2) What is the strength of relationship between ranges of talent and the 
reliability and validity coefficients? 

3) Do the results generalize to other cloze studies? q 
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METHOD 
Subjtctt 

The samples in this study were all randomly selected (to be approximately equal 
in si/e) from larger university level populations and consisted of four groups which 
will be labeled as follows: 1) 1978 sample, 2) 1981 sampled) Winter 1982 sample and 
4) Spring 1982 sample. The four samples (described in Table I) differed in many 
ways, but it is particularly important to notice the way they differed in terms of 
estimated TOEFL score ranges (see last column). From these estimates, it is clear 
that the groups differed considerably in the ranges of talent represented in each. 

Maltriils 

The cloze passage under investigation here was adapted from Man and His 
H^or/^/ (Kurilecz 1969), an intermediate ESL reader. The passage was 399 words 
long and had an every 7th word deletion pattern for a total of Tifty blanks. To 
provide context, two sentences were left intaa (that is, without blanks) at the 
beginning of the passage and one at the end. 

The measures used to calculate the criterion-related validity coefHcients were all 
standardized (norm-referenced) English language placement or proficiency tests. 
They differed from sample to sample as follows: 1978 sample— UCLA English as a 
Second Language Placement Examination (ESLPE) (including listening and 
reading comprehension, dictation and structure subtests); 1981 sampler-Guang- 
zhou English Language Center (GELC) Placement Tes' (including listening and 
reading comprehension, as well as writing and structure subtests); Winter 1982 
sample— Tes/ of English as a Second Language Practice Kit Number I (including 
listening, structure and written expression, as well as reading comprehension and 
vocabulaiy); Spring 1982 sample— UCLA ESLPE (a shorter version of the ESLPE 
used above without the dictation subtest). 

Procaduras 

Exactly the same cloze passage was administered to each of the four samples 
and no more than two weeks separated its administration from that of the validity 
criterion measure. The cloze test was scored using two scoring methods: the exact- 
answer method (EX), wherein only the word found in the original passage is counted 
correct, and the acceptable-word method (AC), wherein any word acceptable to 
native speakers is counted correct. The latter method was based on the responses of 
77 UCLA freshman composition students (Brown 1978). 

Analysas 

The descriptive test statisticr. in this study include the mean (x), standard 
deviation (S) and range. Cronbach alpha (ru) internal consistency reliabilities are 
also given along with criterion-related validity coefficients (r,y). The latter were 
calculated by determimng the correlation between the cloze tests and the criterion 
measure in question. All correlation coefTicients reported in this study are Pearson 
product-moment coefficients. 

Fisher z transformations were used whenever correlation coefficients were 

4 



Tible 1: Simple Descriptions 



C i J I D» *. ^^^^^ 

Sex Attdcmic Stilus jqjPI^ 

Simple Plicc n M F Exmion Mr^radrntt Gradmt Nitioniiities Mijor Ringe 

1978 UCLA 55 45% M K 36% 46% Numerous Numerous very wide 

(See Brown 1978) (See Brown 1978) (range =1500) 

1981 GELC <5 93% 7% 0 0 100% Chinese Engineering (100%) 259-578 

(range =319) 

Winter GELC 45 78% 22% 0 0 100% Chinese Biochemistry (38%) 440-600 

1982 Chemistry (33%) (range = 160) 

Biology (29%) 

Spring GELC 45 80% 20% 0 0 100% Chinese Engineering (44%) 435-515 

1982 Agriculture (16%) (range =80) 



Other Sciences (33%) 



JaMs Daaii Brown 



113 



compared with standard deviations in order to correct for the non-symmetrical 
distribution of such coefficients. In general, this is necessary in order to draw correct 
inferences about sample correlation coefficients which are not near zero (Guilford 
and Fruchter 1973: I44«46). 



RESULTS 

Descriptive test characteristics are reported in Table 2 for the cloze test 
administered to the four different samples. These are the four samples described 
above. Remember that they were quite different in ranges of talent. These differences 
were also reflected on the cloze test in terms of test ranges (rows four and ten) and 
perhaps more accurately in the standard deviations (rows two and eight). 



Scorinc 
Method 
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Thf Effects of Oifftrant Rangis of TaMal oa iia Raiabity aad Vaidity of Ooit. 

In Table 3, the results are rearranged to illustrate that the reliability and validity 
coefficients decrease when the range of talent (as represented by standard deviation 
and test range) decreases. 1 he deletion pattern, blank length, passage difficulty, test 
length and time allowed for the test were all held constant here while the sample 
range was systematically varied. The results indicate that a relationship exists 
between range of talent, and the various reliability and validity coefficients. 

Another way of looking at this problem is to adjust the observed reliability 
coefficients for homogeneity of variances, or restrictions in range (after Magnusson 
1967:75). When this is done, it turns out that the adjusted reliability coefficients arc 
all between .95 and .96. Thus, the reliability coefficients would be virtually the same 
for all of the samples if it were not for the differences in variance. In short, the results 
here demonstrate that restrictions in range of talent do indeed depress the reliability 
and validity coefllcients consistent with psychometric theory. 



114 



\ liiia b • Gloia t« • Gloia? 



hh\t 3; Ranges of Talent in Relationship tr RcMability and V«Udi;y of Cloze 
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The Straag* ol Rriatio«Mp balMaa RaagM of Tabm ami *• ftoiabity •«! Vaidify 

To evalu^'e the strength of association between range of talent, and reiiability 
and validity coefficients, correlational analysis was performed. The correlation 
between standard deviations and reliability coefficients was found to be r .97 for 
the two scoring methods combined. This indicates that about 93 percent (r^) of the 
variation in reliability coefficients can be accounted for by knowing the standaird 
deviations. Likewise, the strength of association between the standard deviations 
and validity coefficients was i^n^i to be r = .93 (r^ = .86). In other words, the 
standard de\'?ation seems to ac^«^ ^r about 86 percent of the variation in validity 
coefficients. 

In short, the results here indicate that variations in sample range, whether 
generated by the sample itself or the scoring method employed, strongly account for 
differences in the reliability and validity coefficients. This effect is so great that, 
depending on the sample and scoring method used, this cloze passage may appear to 
be one of the best passages ever reported (ru = .90; ri, = .95 for AC 1978) or a 
handsHlown loser of the worst (r„ = .31; r», = .43 for EX Spring 1982). 

Gmtraiaiiitir tf *• RMiH ts (hhar Ciozt StMditt. 

In answering this question, only those studies which provided clear and 
complete information (that is, standard deviation, reliability and validity co- 
efficients) could be considered. In addition, only those based on SO-item passages 
scored by the EX and AC methods were included. The results of forty different sets 
of results are presented in Table 4. The correlation between the standard deviations 
and the reliability estimates throughout Table 4 was found to be .91. The squared 
value of this coefficient, .83* indicates that about 83 percent of the variation in 
reliability coefficients is explained by variation in the magnitude of the standard 
deviations. Likewise, the correlation between the standard deviations and the 
validity coefficients was .78 which shows that approximately 61 percent of the 
variation in validity coefficients is explained by variation in the standard deviations. 



IIS 



Notice that b of thr^ fciAtk 4. found here even though Tive different 

deletion patterns and scoring nrcu ^ : ^ere combined. 

In summary, the clu^c literature to c indicates that cloze may or may not be 
highly reliable and viMid «s t r>orm-rcfr^"'^cd test of o%erall second language 
proflciency There i^^^ hercindicaieihatt lay be largely due to differences in the 
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•Cocfncicni doct not seem to fit the ordchnf. 
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way a given cloze passage relates to a given sample. This is consistent with 
psychometric theory and apparently is a factor in other clo^e studies. 

DISCUSSION 

It should be emphasized that cloze is being vie^ here as a norm-referenced 
test for purposes of placement or proflctcncy testing in ESL/ EFL programs. Thus, 
the statistical concepts of reliability, validity, etc. are important considerations 
though they may seem a bit tedious to the hardworking teachers/ administrators in 
the field. To make these results more relevant to those very teachers, both theoretical 
and practical implications will be dicusscd here. 

Thaiff ttcal lflpiea%oai 

To the language testing specialist, the results here may seem obvious, based on 
knowledge of psychometric theory, to the point of being uninteresting. It may be, 
however, that the obvious has been overlooked in favor of the fashionable. Put in 
more scientific terms, the most parsimonious explanations of the phenomena we are 
observing in cloze testing may be found in the psychometric theory and statistical 
techniques being used. Or, the tools themseWcs may hold the clues to clear 
interpretations of the data. 

Let us take for example a rather naive study (Brown 1980), the author of which 
will most definitely not sue for libel. In this study, four scoring methods were 
compared on the basis of reliability coefficients (ranging from .89 to .95), validity 
coefficients (ranging from .88 to .91) and other test characteristics. One conclusion 
drawn was that *nhe best ovem// scoring method is the AC method** (p. 316). While 
thb conclusion seemed reasonable at the time based on previous research, 
information was available in that study, which should have been examined. For 
instance, the AC scoring method was nearly perfectly centered for the given sample 
(X = 2S.S8 out of SO) and was the only scoring method for which the subjects were 
normally distributed (with the highest standa^'d deviation of 12.43). The other three 
scoring methods produced distributions which were either negatively or positively 
skewed for the particular samples in question with correspondingly lower standard 
deviations. In addition, the same cloze passage administered to other samples in 
China has here been shown to have entirely different distributions in each of the 
samples with corresponding differences in the reliability and validity coefficients 
produced. 

In short, the results obtained in Brown (1980) might have been quite different 
had intuition and good luck not guided the researcher to the particular passage and 
sample of subjects involved. Therefore, a more parsimonious and sensible 
hypothesis for differences in reliability and validity for different scoring methods(or 
deletion patterns, difficulty ievels, etc.) might be that adjusting any and all variables 
which help to make a given cloze passage more appropriate for a given sample will 
correspondingly help to produce a test which is statistically more reliable and valid. 

Furthermorci it appears that cloze is not necessarily a reliable, valid and easy to 
develop test of overall second language proficiency as is ofien believed (for example, 
Soudek and Soudek 1983). In fact, it is probably erroneous to say that cloze is 
anything; rather, it wou Id be safer to take the position that cloze tests are a "family of 
item types** (Mullen 1979) which can tap the wide range in the universe of possible 
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language proficiency itetm (at least in the receptive/productive modes on written 
material). 

It cannot be taken as a foregont conclusion that a given clo/c test will be reliable 
and valid for a given sample because it would be a rare sample whose abilities 
spanned the entire range of possible items. Nevcnheless, it is necessary to make 
decbions within samples that an: more or less narrow in terms of ranges of talent. 
Therefore, it would seem that a cloxe test should be made lofii a particular sample if 
decisions based on the results are lo be responsible. This last necessity may preclude 
the notion that doze tests are easy lo develop. 



Praciieai fciicaiioiw 

How can a ckwe test be made to fit a given ianipte? First and foremost, cloze 
tests should be pretested tike any other language tests so that the results can 
eventually provide clear interpretations. To this end, cloze items can be se- 
lected/fitted to a given sampk in one of three ways; I) the hit or miss method, 2) the 
modiucation method or 3) the welKuilored cloie method. 

Tka kit ar aita Mtbod. This shotgun approach to test development would involve 
selectmg a relatively large number of tests, deleting every nth word and administer- 
ing all of them to a sample of students represenuttve of the group about which 
decisions would uhimately be made. After analyzing the results, chat cloze test which 
seemed to produce the best distribution of scores could be selected for later decision 
making. In other words, the cloze passage which seemed to best center the sample 
(that IS, produced a mean of about 50%conrect)and which appeared most sensitive 
to the range of talent in that sampk (that is, produced a high standard deviation) 
could be selected for later use with the entire group. 

Tkf aiidilictiiaa Mttkad, To adopt this method, one cloze passage, which was 
thought to be intuitively afrou/ the right kvelfw the group, could be developed and 
administered to a sample representative of the larger group. After analyzing the 
results using the EX scoring method, modifications could be made consistent with 
what has been found in the literature to date. For instance, if the cloze test in 
question was found to be much too difficult for the group (for example* produced a 
mean of 25% correct), it seems likely that lengthening the passage and increasing the 
distance between the bUnks (from say every 7th word to every I Ith word) would 
help to better center the scores. Alternatively, the mean coukl be somewhat 
artifically increased by using the AC scoring met hod. Using the AC method hasaUo 
been shown to produce higher standanJ deviations in nruiny but not all studies. The 
modified passage should then be readministeitd and reanalyzed to see that the 
desirerl effects had occurred and that the passage indeed fit the entire group. 

Tki wfl-taiiarai elan. It has been shown (Brown Unpublished ms.) that tradi- 
tional test development techniques can be applied to a cloze test to increase the 
reliability of that instrumem. Five different, but non-overlapping, every 7th word 
deletion pattern vereions of one passage (50 items each) were administered to 
random samples of a group of Chinese studenu who had a very narrow range of 
talent. Analysis of the results produced item difficulty and discrimination indices for 
a pool of 250 possible items. From these items the best 50 were selected. In other 
words, those v hich had item diflkulty levels most closely approximating .50and the 

. in . 
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highest dtscrimiiution indices were chosen One restriction was placed on (his 
selection process. The distance between items on the final version was to be no tess 
than fivt words and no more than nine with an average of seven words. The new 
version of the test was then readministered to the same group after six wxk% (to 
avoid iesiing effect) and found to be much more reliable than the original version 
with this same group. These results suggest that a cloxe test can be tailored to flt a 
given group in much the ume way that diKrete-point tests have traditionally been 
developed (though perhaps without the same precision because of the differences in 
the comext provided in the various versions involved). 

Returning to the title of thb study* and the overall question involved* it appean 
that a cloze it not a cloze is no: a cloze, in fact« theyappear to differ quite widely in 
effectiveness as norm-referenced instruments. This efTectiveness in terms of relia- 
bility and validity, appean to be strongly related to how well a given cloze passage 
fitt a given umpie. Therefore, pretesting any cloze passage(s) seems absolutely es- 
sential so that an appropriate passage can be selected, modifications can be made or 
a passage can be uilored to fit a particular group of students. Taking some or all of 
these steps should help to produce a more highly reliable and valid norm-referenced 
instrument. Only then can adequately responsible decisions Iw based on the scores of 
our students on such a test. 
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